Homework review on Monday
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1 ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
library(modelr)
red_wine <- read_csv("data/wine_quality_red.csv")
Rows: 1599 Columns: 14── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): region
dbl (13): wine_id, fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
white_wine <- read_csv("data/wine_quality_white.csv")
Rows: 4898 Columns: 14── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): region
dbl (13): wine_id, fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set up method
- what goodness of fit measure? adj R2, AIC/BIC
- test-train split? yes; 90:10 (160 test is ok for smallest df, red
wine)
- K-fold CV? yes - to prevent overfitting
- model? or models? red/white separately
try white model first
# set up test-train split
n_data_white <- nrow(white_wine)
test_prop <- 0.1
test_index_white <- sample(1:n_data_white, size = n_data_white*test_prop)
test_white <- slice(white_wine, test_index_white)
train_white <- slice(white_wine, -test_index_white)
Note any transformations we apply to our training set from now on, we
can recapitulate in our predictive model formula later for our test
data.
explore the data
- what does 1 row represent? 1 row = 1 wine, quality score is an
average
- what vars do we have? physicochemical properties of the wine, plus
region
- what transformations might make sense? some skewed distributions
might need transforming
- what influences quality? don’t know yet, will need to explore
correlations; data dictionary has some interesting info
skimr::skim(train_white) %>% view()
train_white %>%
select(quality, 2:7) %>%
ggpairs(progress = F)

Can see slight negative correlation in fixed_acidity.
Looking at chlorides, maybe the middle amount is the best: could
transform to abs(diff_from_mean) or (diff_from_mean)^2 to frame it as
extreme v middle values. Could also make them into z-scores (x - mu /
sd). Maybe also similar situ for citric_acid.
train_white %>%
select(quality, 8:14) %>%
ggpairs(progress = F)

Could believe alcohol is positively correlated, also knowing this
influences beer quality score (prior domain knowledge). Other ones don’t
look like correlations, maybe some extreme values influencing.
Region doesn’t look like it has any influence, not worth
including.
Seeing correlations between predictors, e.g. alcohol and density -
but Jamie wouldn’t chuck out because of this, could use interaction,
could decide not to include density if already have alcohol, etc, better
to include to have this choice.
So we have: alcohol (+ve, also correlated with density),
fixed_acidity (-ve) and maybe a transformed measure of chlorides and
citric_acid.
transform
train_white_fe %>%
ggplot(aes(x = chlorides_diff_to_mean, y = quality)) +
geom_point() +
geom_smooth(method = lm)

train_white_fe %>%
ggplot(aes(x = citric_acid_diff_to_mean, y = quality)) +
geom_point() +
geom_smooth(method = lm)

Slight negative correlation for chlorides, more negative correlation
for citric_acid.
Not sure this is working.
Need to have a look again and put quality on the y-axis - in ggpairs,
put quality as the last var not the first one:
train_white_fe %>%
ggplot(aes(x = chlorides, y = quality)) +
geom_point()

train_white_fe %>%
ggplot(aes(x = citric_acid, y = quality)) +
geom_point()

Now we see that there is not the shape that high/low quality is
high/low chlorides. Citric acid still seems to have this shape though,
might be worth keeping.
Don’t use the chloride transformation as we move forward. Do consider
citric_acid_diff_to_mean.
Look at correlations with the new log_ vars
train_white_fe %>%
select(quality, starts_with("log_")) %>%
select(1:8) %>%
select(starts_with("log_"), quality) %>% # put quality last so it's on y-axis vs others
ggpairs(progress = F)

Maybe also log_volatile_acidity, log_chloride (both -ve)
train_white_fe %>%
select(quality, starts_with("log_")) %>%
select(1, 9:14) %>%
select(starts_with("log_"), quality) %>% # put quality last so it's on y-axis vs others
ggpairs(progress = F)

Log_alcohol is strong (but already have alcohol), log_density corr
looks like it might be stretched by some extreme values - ignore (but
maybe interaction with alcohol, still here with the log values),
log_total_sulfur_dioxide might also have this hump shape that we could
transform too.
train_white_fe <- train_white_fe %>%
mutate(diff_log_total_sulfur_dioxide = abs(log_total_sulfur_dioxide - mean(log_total_sulfur_dioxide)), .after = log_total_sulfur_dioxide)
train_white_fe %>%
ggplot(aes(x = diff_log_total_sulfur_dioxide, y = quality)) +
geom_point() +
geom_smooth(method = lm)

There might be something here too.
Features of interest:
- alcohol (+ve, also correlated with density)
- fixed_acidity (-ve)
- citric_acid_diff_to_mean (-ve)
- diff_log_total_sulfur_dioxide
- log_volatile_acidity
- log_chloride
start modelling
forward stepwise
summary(mod1)
Call:
lm(formula = quality ~ alcohol, data = train_white_fe)
Residuals:
Min 1Q Median 3Q Max
-3.5112 -0.5511 -0.0212 0.5450 3.2949
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.52660 0.11066 22.83 <2e-16 ***
alcohol 0.32005 0.01045 30.64 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8522 on 4407 degrees of freedom
Multiple R-squared: 0.1756, Adjusted R-squared: 0.1754
F-statistic: 938.6 on 1 and 4407 DF, p-value: < 2.2e-16
# check remaining residuals
train_white_fe %>%
add_predictions(mod1) %>%
add_residuals(mod1) %>%
select(2:8, resid) %>% # for first 7 vars
ggpairs(progress = F)

train_white_fe %>%
add_predictions(mod1) %>%
add_residuals(mod1) %>%
select(9:15, resid) %>% # for first 6 vars
ggpairs(progress = F)

train_white_fe %>%
add_predictions(mod1) %>%
add_residuals(mod1) %>%
select(16:23, resid) %>% # for first 6 vars
ggpairs(progress = F)

train_white_fe %>%
add_predictions(mod1) %>%
add_residuals(mod1) %>%
select(24:30, resid) %>% # for first 6 vars
ggpairs(progress = F)

Maybe volatile acidity(log and normal) is a good next one, maybe also
log_sulphur_dioxide, log_total_sulfur_dioxide
add volatile acidity
library(ggfortify)
autoplot(mod2)

Looks good for our assumptions of linear regression
summary(mod2)
Adj R2 is 0.2241 (vs 0.1754 for mod 1) so looking like a better model
here
anova(mod1, mod2)
Analysis of Variance Table
Model 1: quality ~ alcohol
Model 2: quality ~ alcohol + volatile_acidity
Res.Df RSS Df Sum of Sq F Pr(>F)
1 4407 3200.8
2 4406 3011.2 1 189.6 277.42 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Yes it’s a different model and reduces RSS so is an improvement
Validity of including aliases / overlapping variables
summary(mod3)
Call:
lm(formula = quality ~ alcohol + volatile_acidity + log_volatile_acidity,
data = train_white_fe)
Residuals:
Min 1Q Median 3Q Max
-3.5983 -0.5386 -0.0175 0.5206 3.2919
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.16962 0.13576 23.346 <2e-16 ***
alcohol 0.33098 0.01015 32.609 <2e-16 ***
volatile_acidity 1.81094 1.67203 1.083 0.2788
log_volatile_acidity -5.19925 2.24540 -2.316 0.0206 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8263 on 4405 degrees of freedom
Multiple R-squared: 0.2254, Adjusted R-squared: 0.2248
F-statistic: 427.2 on 3 and 4405 DF, p-value: < 2.2e-16
Including the log_ one makes ordinary one normal, and even flips the
sign of the coefficient. Don’t want to include both of these.
Could try mod2 but with the log_ one instead, and compare to original
mod2. Overall, best to make as few models as possible, so this isn’t
great (given stepwise’s flaws).
mod2b <- lm(quality ~ alcohol + log_volatile_acidity, data = train_white_fe)
summary(mod2)
Call:
lm(formula = quality ~ alcohol + volatile_acidity, data = train_white_fe)
Residuals:
Min 1Q Median 3Q Max
-3.6027 -0.5432 -0.0098 0.5224 3.2785
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.98798 0.11086 26.95 <2e-16 ***
alcohol 0.33049 0.01015 32.55 <2e-16 ***
volatile_acidity -2.05016 0.12309 -16.66 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8267 on 4406 degrees of freedom
Multiple R-squared: 0.2244, Adjusted R-squared: 0.2241
F-statistic: 637.4 on 2 and 4406 DF, p-value: < 2.2e-16
summary(mod2b)
Call:
lm(formula = quality ~ alcohol + log_volatile_acidity, data = train_white_fe)
Residuals:
Min 1Q Median 3Q Max
-3.6008 -0.5388 -0.0128 0.5219 3.2856
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.08709 0.11237 27.47 <2e-16 ***
alcohol 0.33080 0.01015 32.59 <2e-16 ***
log_volatile_acidity -2.77389 0.16522 -16.79 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8263 on 4406 degrees of freedom
Multiple R-squared: 0.2252, Adjusted R-squared: 0.2248
F-statistic: 640.1 on 2 and 4406 DF, p-value: < 2.2e-16
Mod2b is slightly better than mod2, not by much!
Remember also that log_ makes it harder to communicate, but can find
ways to translate this for our audience (“X influences Y”; “X increases
Y”; “X increases Y, in that when you multiply X by 10, Y increases by
1”, “volatile_acidity has a log normal distribution” (if true that log_
normalises the dist))
A better sense might be to look at correlations in ggpairs() and pick
the more closely correlated ones.
In terms of truly aliased variables (like a, b and c where a + b =
c), we do not to remove aliases here as a hyigene factor before
modelling.
CV
can do this multiple times - with non-transformed, with
log_transformed, with subset of 4 vars (e.g. if have a sensible guess or
from this initial stepwise exploration) –> then compare these output
models to see which is best.
test/train split first and you have an unused test 10% that can then
use to test the final models and understand predictive accuracy.
So, will hit this more later (including “hyperparameter training”)
but consider:
- test/train split first - need a leftover test set (validation set)
to tell us how good our trained model was, which of our trained models
to use (which hyperparameters were best - the setup of our model
process)
- do some exploration first to understand vars, even with some
stepwise regression to get to which small subset of vars might be useful
to try alongside all (outcome ~ .) for transformed and untransformed
vars
- use K-fold CV to finetune the success measures, have a few model
options to try when predicting outcome for test data
hyperparameters
Basically, anything in machine learning and deep learning that you
decide their values or choose their configuration before training begins
and whose values or configuration will remain the same when training
ends is a hyperparameter. Here are some common examples: - Train-test
split ratio - Learning rate in optimization algorithms (e.g. gradient
descent) - Choice of optimization algorithm (e.g., gradient descent,
stochastic gradient descent, or Adam optimizer)
source: https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac
---
title: "Jamie's wine model"
output: html_notebook
---

Homework review on Monday

```{r}
library(tidyverse)
library(GGally)
library(modelr)

red_wine <- read_csv("data/wine_quality_red.csv")
white_wine <- read_csv("data/wine_quality_white.csv")
```

# set up method

* what goodness of fit measure? adj R2, AIC/BIC
* test-train split? yes; 90:10 (160 test is ok for smallest df, red wine)
* K-fold CV? yes - to prevent overfitting
* model? or models? red/white separately

try white model first

```{r}
# set up test-train split first
n_data_white <- nrow(white_wine)
test_prop <- 0.1

test_index_white <- sample(1:n_data_white, size = n_data_white*test_prop)
test_white <- slice(white_wine, test_index_white)
train_white <- slice(white_wine, -test_index_white)
```

Note any transformations we apply to our training set from now on, we can recapitulate in our predictive model formula later for our test data.

# explore the data

* what does 1 row represent? 1 row = 1 wine, quality score is an average
* what vars do we have? physicochemical properties of the wine, plus region
* what transformations might make sense? some skewed distributions might need transforming
* what influences quality? don't know yet, will need to explore correlations; data dictionary has some interesting info

```{r}
skimr::skim(train_white) %>% view()
```

```{r}
train_white %>% 
  select(quality, 2:7) %>% 
  ggpairs(progress = F)
```

Can see slight negative correlation in fixed_acidity.

Looking at chlorides, maybe the middle amount is the best: could transform to abs(diff_from_mean) or (diff_from_mean)^2 to frame it as extreme v middle values. Could also make them into z-scores (x - mu / sd).
Maybe also similar situ for citric_acid.


```{r}
train_white %>% 
  select(quality, 8:14) %>% 
  ggpairs(progress = F)
```

Could believe alcohol is positively correlated, also knowing this influences beer quality score (prior domain knowledge). Other ones don't look like correlations, maybe some extreme values influencing.

Region doesn't look like it has any influence, not worth including.

Seeing correlations between predictors, e.g. alcohol and density - but Jamie wouldn't chuck out because of this, could use interaction, could decide not to include density if already have alcohol, etc, better to include to have this choice.

So we have: alcohol (+ve, also correlated with density), fixed_acidity (-ve) and maybe a transformed measure of chlorides and citric_acid.

## transform

```{r}
train_white_fe <- train_white %>% 
  # log transform
  mutate(across(.cols = where(is.numeric),
                .fns = ~ log(1 + .x),
                .names = "log_{.col}")) %>% # this is how to create new cols instead of overwriting
  # transform the extreme value hump data
  mutate(chlorides_diff_to_mean = abs(chlorides - mean(chlorides)), .after = chlorides) %>%  # check skim, see mean ~= median
  mutate(citric_acid_diff_to_mean = abs(citric_acid - mean(citric_acid)), .after = citric_acid)

head(train_white_fe)
```

```{r}
train_white_fe %>% 
  ggplot(aes(x = chlorides_diff_to_mean, y = quality)) +
  geom_point() +
  geom_smooth(method = lm)

train_white_fe %>% 
  ggplot(aes(x = citric_acid_diff_to_mean, y = quality)) +
  geom_point() +
  geom_smooth(method = lm)
```

Slight negative correlation for chlorides, more negative correlation for citric_acid.

Not sure this is working.

Need to have a look again and put quality on the y-axis - in ggpairs, put quality as the _last_ var not the first one:

```{r}
train_white_fe %>% 
  ggplot(aes(x = chlorides, y = quality)) +
  geom_point()

train_white_fe %>% 
  ggplot(aes(x = citric_acid, y = quality)) +
  geom_point()
```

Now we see that there is not the shape that high/low quality is high/low chlorides. Citric acid still seems to have this shape though, might be worth keeping.

Don't use the chloride transformation as we move forward. Do consider citric_acid_diff_to_mean.

Look at correlations with the new log_ vars

```{r}
train_white_fe %>% 
  select(quality, starts_with("log_")) %>% 
  select(1:8) %>% 
  select(starts_with("log_"), quality) %>%  # put quality last so it's on y-axis vs others
  ggpairs(progress = F)
```

Maybe also log_volatile_acidity, log_chloride (both -ve)

```{r}
train_white_fe %>% 
  select(quality, starts_with("log_")) %>% 
  select(1, 9:14) %>% 
  select(starts_with("log_"), quality) %>%  # put quality last so it's on y-axis vs others
  ggpairs(progress = F)
```

Log_alcohol is strong (but already have alcohol), log_density corr looks like it might be stretched by some extreme values - ignore (but maybe interaction with alcohol, still here with the log values), log_total_sulfur_dioxide might also have this hump shape that we could transform too.

```{r}
train_white_fe <- train_white_fe %>% 
  mutate(diff_log_total_sulfur_dioxide = abs(log_total_sulfur_dioxide - mean(log_total_sulfur_dioxide)), .after = log_total_sulfur_dioxide)

train_white_fe %>% 
  ggplot(aes(x = diff_log_total_sulfur_dioxide, y = quality)) + 
  geom_point() +
  geom_smooth(method = lm)
```

There might be something here too.

Features of interest: 

* alcohol (+ve, also correlated with density)
* fixed_acidity (-ve)
* citric_acid_diff_to_mean (-ve)
* diff_log_total_sulfur_dioxide
* log_volatile_acidity
* log_chloride

# start modelling

forward stepwise

```{r}
mod1 <- lm(quality ~ alcohol, train_white_fe)
summary(mod1)
```

```{r}
# check remaining residuals
train_white_fe %>% 
  add_predictions(mod1) %>% 
  add_residuals(mod1) %>% 
  select(2:8, resid) %>% # for first 7 vars
  ggpairs(progress = F)
```

```{r}
train_white_fe %>% 
  add_predictions(mod1) %>% 
  add_residuals(mod1) %>% 
  select(9:15, resid) %>% # for first 6 vars
  ggpairs(progress = F)
```


```{r}
train_white_fe %>% 
  add_predictions(mod1) %>% 
  add_residuals(mod1) %>% 
  select(16:23, resid) %>% # for first 6 vars
  ggpairs(progress = F)
```


```{r}
train_white_fe %>% 
  add_predictions(mod1) %>% 
  add_residuals(mod1) %>% 
  select(24:30, resid) %>% # for first 6 vars
  ggpairs(progress = F)
```


Maybe volatile acidity(log and normal) is a good next one, maybe also log_sulphur_dioxide, log_total_sulfur_dioxide

## add volatile acidity

```{r}
mod2 <- lm(quality ~ alcohol + volatile_acidity, train_white_fe)
```


```{r}
library(ggfortify)
autoplot(mod2)
```

Looks good for our assumptions of linear regression


```{r}
summary(mod2)
```

Adj R2 is 0.2241 (vs 0.1754 for mod 1) so looking like a better model here

```{r}
anova(mod1, mod2)
```

Yes it's a different model and reduces RSS so is an improvement

## Validity of including aliases / overlapping variables

```{r}
mod3 <- lm(quality ~ alcohol + volatile_acidity + log_volatile_acidity, train_white_fe)

summary(mod3)
```

Including the log_ one makes ordinary one normal, and even flips the sign of the coefficient. Don't want to include both of these.

Could try mod2 but with the log_ one instead, and compare to original mod2. Overall, best to make as few models as possible, so this isn't great (given stepwise's flaws).

```{r}
mod2b <- lm(quality ~ alcohol + log_volatile_acidity, data = train_white_fe)

summary(mod2)
summary(mod2b)
```

Mod2b is slightly better than mod2, not by much!

Remember also that log_ makes it harder to communicate, but can find ways to translate this for our audience ("X influences Y"; "X increases Y"; "X increases Y, in that when you multiply X by 10, Y increases by 1", "volatile_acidity has a log normal distribution" (if true that log_ normalises the dist))

A better sense might be to look at correlations in ggpairs() and pick the more closely correlated ones. 

In terms of truly aliased variables (like a, b and c where a + b = c), we do not to remove aliases here as a hyigene factor before modelling.

## CV

can do this multiple times - with non-transformed, with log_transformed, with subset of 4 vars (e.g. if have a sensible guess or from this initial stepwise exploration) --> then compare these output models to see which is best.

test/train split first and you have an unused test 10% that can then use to test the final models and understand predictive accuracy.

So, will hit this more later (including "hyperparameter training") but consider:

* test/train split first - need a leftover test set (validation set) to tell us how good our trained model was, which of our trained models to use (which hyperparameters were best - the setup of our model process)
* do some exploration first to understand vars, even with some stepwise regression to get to which small subset of vars might be useful to try alongside all (outcome ~ .) for transformed and untransformed vars
* use K-fold CV to finetune the success measures, have a few model options to try when predicting outcome for test data

### hyperparameters

> Basically, anything in machine learning and deep learning that you decide their values or choose their configuration before training begins and whose values or configuration will remain the same when training ends is a hyperparameter.
Here are some common examples:
- Train-test split ratio
- Learning rate in optimization algorithms (e.g. gradient descent)
- Choice of optimization algorithm (e.g., gradient descent, stochastic gradient descent, or Adam optimizer)

source: https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac





